Skip to content

fix: support non-F32 quantized types in CUDA concat op#4

Open
cdome94 wants to merge 2 commits into
antirez:mainfrom
cdome94:main
Open

fix: support non-F32 quantized types in CUDA concat op#4
cdome94 wants to merge 2 commits into
antirez:mainfrom
cdome94:main

Conversation

@cdome94
Copy link
Copy Markdown

@cdome94 cdome94 commented May 4, 2026

Overview

ggml_cuda_op_concat crashed with GGML_ASSERT(src0->type == GGML_TYPE_F32) when running DeepSeek V4 Flash quantized GGUF models on NVIDIA CUDA, making it impossible to use -ngl on any NVIDIA GPU.

Root causes:

  • Three hard assertions requiring F32 type blocked any quantized input
  • Float offset calculations used hardcoded / 4 (sizeof float) instead of ggml_nbytes()

Fix:

  • Removed F32-only assertions, replaced with src0->type == dst->type consistency check
  • Added byte-level cudaMemcpy path for contiguous quantized tensors along dim 1/2/3
  • F32 path left entirely unchanged

Tested on: NVIDIA GB10 (122 GB unified memory), DeepSeek V4 Flash IQ2XXS-w2Q2K-AProjQ8-SExpQ8 quantization.

Additional information

Related discussion: ggml-org/llama.cpp#22376

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — the fix was developed with AI assistance (Claude) for code analysis and patch generation. The patch was manually reviewed, tested, and verified working on hardware by the submitter.

Remove hardcoded F32 assertions in ggml_cuda_op_concat.
Add byte-level cudaMemcpy path for contiguous quantized tensors
(dim 1/2/3). Fix hardcoded /4 float offset to use ggml_nbytes().
Enables running DeepSeek V4 Flash quantized GGUF on NVIDIA CUDA.
@emcalv
Copy link
Copy Markdown

emcalv commented May 6, 2026

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

@cdome94
Copy link
Copy Markdown
Author

cdome94 commented May 7, 2026

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

Hi! Glad it's working for you.

Here are the parameters I'm using:

./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -ngl 999 \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080

I'm getting around 8-12 t/s on the GB10 with this setup. A few things that might affect speed:

  • Make sure all layers are offloaded to GPU (-ngl 999) and no layers are falling back to CPU
  • The GB10 uses unified memory so there's no CPU↔GPU transfer overhead — if you're on a discrete GPU setup the bandwidth characteristics will be different
  • Context length has a significant impact: I was initially running -c 131072 which halved my throughput compared to -c 65536

What context length are you using?

@emcalv
Copy link
Copy Markdown

emcalv commented May 7, 2026

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

Hi! Glad it's working for you.

Here are the parameters I'm using:

./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -ngl 999 \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080

I'm getting around 8-12 t/s on the GB10 with this setup. A few things that might affect speed:

* Make sure all layers are offloaded to GPU (`-ngl 999`) and no layers are falling back to CPU

* The GB10 uses unified memory so there's no CPU↔GPU transfer overhead — if you're on a discrete GPU setup the bandwidth characteristics will be different

* Context length has a significant impact: I was initially running `-c 131072` which halved my throughput compared to `-c 65536`

What context length are you using?

hey, thanks for the reply - tried to decrease the context and it's all on GPU but still no luck, not many tokens/s; setup is the same, DGX spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants